Improving annotation propagation on molecular networks through random walks: introducing ChemWalker

Abstract Motivation Annotation of the mass signals is still the biggest bottleneck for the untargeted mass spectrometry analysis of complex mixtures. Molecular networks are being increasingly adopted by the mass spectrometry community as a tool to annotate large-scale experiments. We have previously shown that the process of propagating annotations from spectral library matches on molecular networks can be automated using Network Annotation Propagation (NAP). One of the limitations of NAP is that the information for the spectral matches is only propagated locally, to the first neighbor of a spectral match. Here, we show that annotation propagation can be expanded to nodes not directly connected to spectral matches using random walks on graphs, introducing the ChemWalker python library. Results Similarly to NAP, ChemWalker relies on combinatorial in silico fragmentation results, performed by MetFrag, searching biologically relevant databases. Departing from the combination of a spectral network and the structural similarity among candidate structures, we have used MetFusion Scoring function to create a weight function, producing a weighted graph. This graph was subsequently used by the random walk to calculate the probability of ‘walking’ through a set of candidates, departing from seed nodes (represented by spectral library matches). This approach allowed the information propagation to nodes not directly connected to the spectral library match. Compared with NAP, ChemWalker has a series of improvements, on running time, scalability and maintainability and is available as a standalone python package. Availability and implementation ChemWalker is freely available at https://github.com/computational-chemical-biology/ChemWalker Contact ridasilva@usp.br Supplementary information Supplementary data are available at Bioinformatics online.

nected to spectral matches using random walks on graphs, introducing the ChemWalker python library. Similarly to NAP, ChemWalker relies on combinatorial in silico fragmentation results, performed by MetFrag, searching biologically relevant databases. Departing from the combination of a spectral network and the structural similarity among candidate structures, we have used MetFusion Scoring function to create a weight function, producing a weighted graph. This graph was subsequently used by the random walk to calculate the probability of 'walking' through a set of candidates, departing from seed nodes (represented by spectral library matches). This approach allowed the information propagation to nodes not directly connected to the spectral library match. Compared to NAP, ChemWalker has a series of improvements, on running time, scalability and maintainability and is available as a stand alone python package. ChemWalker is freely available at https://github.com/computational-chemical-biology/ChemWalker.

Model description
The MetFusion [1] model uses spectral similarity to inform in silico fragmentation prediction. The approach re-calculates the MetFrag score using the following formulation where the index c represents each MetFrag candidate, f c the MetFrag [2] score and the spectral similarity is represented by the spectral library match score m j for all j neighbor nodes, multiplied by the chemical similarity t cj between MetFrag candidates c associated to each neighbor node j. The sig represents the sigmoid function.
The Network Annotation Propagation (NAP) model uses the MetFusion concept to propagate annotations from spectral library matches through the network topology [3].
The Figure 1 illustrates the propagation concept. Figure 1: Network Annotation Propagation. The Fusion scoring uses the reference from spectral similarity to evaluate structural similarity to candidate structures and re-rank the candidates. The Consensus scoring uses the similarity among candidates to retrieve sets of similar candidates.

Structural similarity graph
From NAP's model one can derive a graph of structural similarity, following the spectral network topology. The Figure 2 illustrates the replacement of the spectra, by its in silico fragmentation candidates and the generation of a structural similarity graph, where the edges are weighted by a similarity score, discussed below.
The Graph G = (N, E) can be represented by a set of |N | = n nodes, representing the candidate structures, and |E| = e edges representing the structural similarity between two given candidates.  The similarity between candidates of the neighbor nodes are used to create a weighted graph. A sequence of nodes is selected (e.g. from v 0 to v 1 ) and the probability to move to a given node is calculated based on weights. The process is repeated until there is no change in probability (convergence).

Random Walk model
Under the assumption that the spectral similarity observed among the spectra infers the structural similarity of the underlying compound structures, the Random Walk on graphs selects an initial node v 0 , and subsequently an adjacent node v 1 and moves to this neighbor [4]. The walks can be determined by a probabilistic model with of the following form where w (i) = j∈Γ(i) w ij , and i and j represent neighbor nodes in the graph. To maintain the MetFusion score setup, averaging spectral and structural similarities, we applied the following formulation where the indexes i and j represent two connected nodes, f index the MetFrag score for each node, the spectral similarity is represented by cosine score m ij between nodes i and j, multiplied by the chemical similarity t ij between MetFrag candidate structures associated to nodes i and j. The sig represents the sigmoid function. The Markov property holds, where conditional on the present, the future is independent of the past, therefore: The probability of choosing a given i-th candidate structure at a step k of the Random Walk is given by the probability vector p k [5], which at any given step can be obtained by where W is the column-normalized adjacency matrix of the graph, p 0 was constructed such that equal probabilities were assigned to the nodes representing 'seed' spectral library matches and r is the restart probability, which restricts on how far we want the random walker to get away from the start 'seed' node(s). The candidate compounds are ranked according to the values in the steady-state probability vector p ∞ . This state is obtained at query time by performing the iteration until the change between p k and p k+1 (measured by the L1 norm) falls below 10 −6 .

Random Walk algorithm
Following the description above the algorithm to compute the candidate probabilities is given by:

Algorithm 1: Random Walk on Graphs algorithm
Input: row-normalized adjacency matrix W , re-start probability r, and tolerance Output: probability vector p 1 set p 0 with equal probability for library matches 6 return steady-state probability vector p k

Benchmarking ChemWalker
In order to benchmark NAP, 5,467 spectra from NIST library were used to compare the propagation scores to MetFrag used on each spectra individually. To extend the benchmark, a subset of 555 spectra, keeping the proportionality of the NAP's result, as shown in the comparison between the subset and the complete dataset (Figure 3), was selected [3], to compare the previous results with ChemWalker.  First, the molecular fingerprint used for propagation was compared. The RDKit (https://www.rdkit.org/) library was used, and 27 fingerprints were compared. As ChemWalker was designed to work with all spectral library matches as seed nodes, we established 10% of nodes of a given connected component to be the number of seeds in the benchmark dataset (one node for connected components with less than 10 nodes). The fingerprint comparison is shown on Figure 4.    Carefully inspecting some of the best performing fingerprints, we observe in Figure 6, that the difference between the performance improvement of ChemWalker over MetFrag seems compatible to Consensus and Fusion scoring.